Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[COPP] Fixed COPP test - reload logic #5651

Closed
wants to merge 1 commit into from

Conversation

ppikh
Copy link
Contributor

@ppikh ppikh commented May 13, 2022

Signed-off-by: Petro Pikh [email protected]

Description of PR

Fixed COPP test - reload logic
Old logic caused time to time failures in first test in class due to traffic loss(looks like DUT not ready after reload). Issue introduced in PR: #5564

Summary: Fixed COPP test - reload logic
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012

Approach

What is the motivation for this PR?

Fix test failure

How did you do it?

See code

How did you verify/test it?

Executed tests from file sonic-mgmt/tests/copp/test_copp.py

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Old logic caused time to time failures in first test in class due to traffic loss

Signed-off-by: Petro Pikh <[email protected]>
@ppikh ppikh requested a review from a team as a code owner May 13, 2022 14:19
@liat-grozovik
Copy link
Collaborator

/azp run Azure.sonic-mgmt

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liat-grozovik
Copy link
Collaborator

@dgsudharsan culd you please help to review?

@yejianquan
Copy link
Collaborator

Hi @liat-grozovik , I didn't understand that,
if the problem is DUT is not ready after config_reload, a safe_reload flag lets DUT wait for the critical services fully started.
Why does removing it helps to solve the issue?

@ppikh
Copy link
Contributor Author

ppikh commented May 16, 2022

Hi @yejianquan
If we use safe_reload - time to time on some DUTs we have failed first test case.
For me it looks like DUT is not ready but we started to send traffic - as result traffic failed.
When I removed argument "safe_reload" - tests passing.

@yejianquan
Copy link
Collaborator

Hi @yejianquan If we use safe_reload - time to time on some DUTs we have failed first test case. For me it looks like DUT is not ready but we started to send traffic - as result traffic failed. When I removed argument "safe_reload" - tests passing.

Hi @ppikh, can you paste error log?
Flag safe_reload runs well on our physical testbeds.
There's a pr merged recently for test_copp's error log issue, not sure whether it's related to your thing?
#5642

@ppikh
Copy link
Contributor Author

ppikh commented May 16, 2022

Hi @yejianquan

Please see error log below

ERROR    tests.ptf_runner:ptf_runner.py:63 Exception caught while executing case: copp_tests.ARPTest. Error message: Traceback (most recent call last):
  File "/root/mars/workspace/sonic-mgmt/tests/ptf_runner.py", line 56, in ptf_runner
    result = host.shell(cmd, chdir="/root", module_ignore_errors=module_ignore_errors)
  File "/root/mars/workspace/sonic-mgmt/tests/common/devices/base.py", line 89, in _run
    raise RunAnsibleModuleFail("run module {} failed".format(self.module_name), res)
RunAnsibleModuleFail: run module shell failed, Ansible Results =>
{
    "changed": true, 
    "cmd": "ptf --test-dir ptftests copp_tests.ARPTest --qlen=100000 --platform nn -t 'peerip=u'\"'\"'10.0.0.22'\"'\"';verbose=False;myip=u'\"'\"'10.0.0.23'\"'\"';send_rate_limit=2000;target_port=11;has_trap=True' --device-socket 0-11@tcp://127.0.0.1:10900  --device-socket 1-11@tcp://10.210.24.22:10900", 
    "delta": "0:00:30.399052", 
    "end": "2022-05-13 03:12:35.341899", 
    "failed": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "ptf --test-dir ptftests copp_tests.ARPTest --qlen=100000 --platform nn -t 'peerip=u'\"'\"'10.0.0.22'\"'\"';verbose=False;myip=u'\"'\"'10.0.0.23'\"'\"';send_rate_limit=2000;target_port=11;has_trap=True' --device-socket 0-11@tcp://127.0.0.1:10900  --device-socket 1-11@tcp://10.210.24.22:10900", 
            "_uses_shell": true, 
            "argv": null, 
            "chdir": "/root", 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "stdin_add_newline": true, 
            "strip_empty_ends": true, 
            "warn": true
        }
    }, 
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2022-05-13 03:12:04.942847", 
    "stderr": "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py:33: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in the next release.\n  from cryptography.hazmat.backends import default_backend\ncopp_tests.ARPTest ... FAIL\n\n======================================================================\nFAIL: copp_tests.ARPTest\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File \"ptftests/copp_tests.py\", line 267, in runTest\n    self.run_suite()\n  File \"ptftests/copp_tests.py\", line 201, in run_suite\n    self.one_port_test(self.target_port)\n  File \"ptftests/copp_tests.py\", line 196, in one_port_test\n    self.check_constraints(send_count, recv_count, time_delta_ms, rx_pps)\n  File \"ptftests/copp_tests.py\", line 257, in check_constraints\n    assert self.PPS_LIMIT_MIN <= rx_pps <= self.PPS_LIMIT_MAX, \"rx_pps {}\".format(rx_pps)\nAssertionError: rx_pps 0\n\n----------------------------------------------------------------------\nRan 1 test in 29.078s\n\nFAILED (failures=1)", 
    "stderr_lines": [
        "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py:33: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in the next release.", 
        "  from cryptography.hazmat.backends import default_backend", 
        "copp_tests.ARPTest ... FAIL", 
        "", 
        "======================================================================", 
        "FAIL: copp_tests.ARPTest", 
        "----------------------------------------------------------------------", 
        "Traceback (most recent call last):", 
        "  File \"ptftests/copp_tests.py\", line 267, in runTest", 
        "    self.run_suite()", 
        "  File \"ptftests/copp_tests.py\", line 201, in run_suite", 
        "    self.one_port_test(self.target_port)", 
        "  File \"ptftests/copp_tests.py\", line 196, in one_port_test", 
        "    self.check_constraints(send_count, recv_count, time_delta_ms, rx_pps)", 
        "  File \"ptftests/copp_tests.py\", line 257, in check_constraints", 
        "    assert self.PPS_LIMIT_MIN <= rx_pps <= self.PPS_LIMIT_MAX, \"rx_pps {}\".format(rx_pps)", 
        "AssertionError: rx_pps 0", 
        "", 
        "----------------------------------------------------------------------", 
        "Ran 1 test in 29.078s", 
        "", 
        "FAILED (failures=1)"
    ], 
    "stdout": "", 
    "stdout_lines": []
}

ERROR    root:__init__.py:40 Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/_pytest/python.py", line 1464, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pluggy/manager.py", line 87, in <lambda>
    firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 81, in get_result
    _reraise(*ex)  # noqa
  File "/usr/local/lib/python2.7/dist-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python2.7/dist-packages/_pytest/python.py", line 174, in pytest_pyfunc_call
    testfunction(**testargs)
  File "/root/mars/workspace/sonic-mgmt/tests/copp/test_copp.py", line 89, in test_policer
    dut_type)
  File "/root/mars/workspace/sonic-mgmt/tests/copp/test_copp.py", line 287, in _copp_runner
    device_sockets=device_sockets)
  File "/root/mars/workspace/sonic-mgmt/tests/ptf_runner.py", line 64, in ptf_runner
    raise Exception
Exception

@yejianquan
Copy link
Collaborator

Hi @saiarcot895
I checked previous test results In April (since the safe_reload flag is introduced at the end of April), test_copp works fine without safe_reload=True in the past,
Did you find any failed test_copp runs that indicate test_copp needs safe_reload=True?
If not, I prefer to merge this pr and observe test results in the future.
but I still don't understand, safe_reload should only make things better.

@saiarcot895
Copy link
Contributor

Hi @yejianquan, I added safe_reload here not because anything was failing, but as a preventative measure, to make sure that everything was up and running before proceeding.

One issue I can see is that if safe reload is enabled, then instead of waiting for a fixed 120 seconds (the old behavior), it may wait anywhere between 0 and 420 seconds (300 seconds added as a buffer for services to come up), which means it may end up waiting for less time if the services happen to be up by then. However, if swss isn't ready when services are reported to be up, maybe there needs to be some other check happening, to make sure everything is ready and any ASIC programming is done.

For now, you can go ahead and merge this PR.

@sanmalho-git
Copy link
Contributor

FYI - We have a PR #5610 open to add for checking if the interfaces are up as an optional action as part of config reload.

@saiarcot895
Copy link
Contributor

If that PR works consistently, then maybe that PR can be merged, and this one can be closed?

@liat-grozovik
Copy link
Collaborator

@saiarcot895 do you mean to close this PR and leave #5610
the other one is bigger change and does not seems to be approved.
I suggest that in order to have a working test we will have this PR first and any additional improvements as suggested by 5610 will be done on top of it. what do you say?

@liat-grozovik
Copy link
Collaborator

@saiarcot895 kindly reminder

@saiarcot895
Copy link
Contributor

I reviewed and merged #5610, as I feel that that might fix the issue seen here. Could you test with those changes and see if the issue is still present?

@ppikh
Copy link
Contributor Author

ppikh commented May 31, 2022

Hi @saiarcot895 - it looks like PR #5610 fixed issue(I did run few times and issue does not reproduce).
I think current PR can be closed.

@ppikh ppikh closed this May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants